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Abstract. In this work, we consider the Combinatorial RNA Design problem, a minimal instance of 
the RNA design problem which aims at finding a sequence that admits a given target as its unique 
base pair maximizing structure. We provide complete characterizations for the structures that can be 
designed using restricted alphabets. 

Under a classic four-letter alphabet, we provide a complete characterization of designable structures 
without unpaired bases. When unpaired bases are allowed, we provide partial characterizations for 
classes of designable/undesignable structures, and show that the class of designable structures is closed 
under the stutter operation. Membership of a given structure to any of the classes can be tested in 
linear time and, for positive instances, a solution sequence can also be generated in linear time. 
Finally, we consider a structure-approximating version of the problem that allows to extend bands 
(helices) and, assuming that the input structure avoids two motifs, we provide a linear-time algorithm 
that produces a designable structure with at most twice more base pairs than the input structure. 


1 Introduction 

RiboNucleic Acids (RNAs) are biomolecules which act in almost every aspect of cellular life, and 
can be abstracted as a sequence of nucleotides, i.e., a string over the alphabet {A, U,C,G}. Due 
to their versatility, and the specificity of their interactions, they are increasingly being used as 
therapeutic agents EH, and as building blocks for the emerging field of synthetic biology HSUS]. 
A snbstantial proportion of the functional roles played by RNA rely on interactions with other 
molecnles to activate/repress dynamical properties of some biological system, and ultimately require 
the adoption of a specific conformation. Accordingly, RNA bioinformatics has dedicated much effort 
to developing energy models [IJEQI and algorithms [111123] to predict the secondary structure of RNA, 
a combinatorial description of the conformation adopted by an RNA which only retains interacting 
positions, or base pairs. Historically, structure prediction has been addressed as an optimization 
problem, whose expected output is a secondary structure which minimizes some notion of free- 
energy The performances of the RNA folding prediction problem have now reached a point 

where in silico predictions have become generally reliable m, allowing for large scale studies and 
fueling the discovery of an increasing nnmber of functional families [S]. 

Due to the existence of expressive, yet tractable, energy models, conpled with promising ap¬ 
plications in multiple fields (pharmaceutical research, natural computing, biochemistry...), a wide 
array of computational methods [hUn 141211912211 1 1121711 nlfT5ll2jl5] have been proposed to tackle 
the natnral inverse version of the structure prediction, the RNA design problem. In this problem, 
one attempts to perform the in silico synthesis of artificial RNA sequences, performing a predefined 
biological function in vitro or in vivo. Given the prevalence of structure in the function of an RNA, 
one of the foremost goal of RNA design (sometimes named inverse folding in the literature) is that 
the designed sequence should fold into a predefined secondary strnctnre. In other words, it should 
not be challenged by alternative stable structures having similar or lower free-energy. 

Despite a rich, fast-growing, body of literature dedicated to the problem, there is currently no 
exact polynomial-time algorithm for the problem. Moreover, the complexity of the problem remains 
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Fig. 1. Four equivalent representations for an RNA secondary structure of length 68, consisting of 20 base pairs 
forming 7 bands: outer-planar graph (a.), arc-annotated representation (b.), parenthesized expression (c.), and tree 
representation (d.) 


open (see Section[^for details). It can be argued that this situation, quite exceptional in the field of 
computational biology, partly stems from the intricacies of the Turner free-energy model |20] which 
associates experimentally-determined energy contributions to ~ 2.4 x 10^ structure/sequence motifs. 
This motivates a reductionist approach, where one studies an idealized version of the RNA design 
problem, lending itself to algorithmic intuitions, while hopefully retaining the presumed difficulty 
of the original problem. 

In this work, we introduce the Combinatorial RNA Design problem, a minimal instance of the 
RNA design problem which aims at finding a sequence that admits the target structure as its 
unique base pair maximizing structure. After this short introduction. Sec. [^states definitions and 
problems. In Sec. we summarize our results, some of which are proven in Sec. Finally, we 
conclude in Sec. with some remarks, open problems and future extensions of this work. 

2 Definitions and notations 

RNA secondary structure. An RNA can be encoded as a sequence of nucleotides, i.e., a string 
w = wi - ■ ■ G {A, U, C, G}*. The prefix of w of length i is denoted as and \w\b denotes the 
number of occurrences of 6 in rc. A (pseudoknot-free) secondary structure S on an RNA of length n 
is a pair {n,P), where P is a set of base pairs {{li,ri)Yi^^ C [l,n]^ such that: 

— Vi G [l,p], k < rp, 

— Vi, j G [l,p],li / Ij, li 7 ^ rj, ri ^ rj (each position is involved in at most one base pair); 

— $i,j G [l,p], k < Ij < ri < rj (base pairs {li,ri) and {lj,rj) do not cross). 

The set of all secondary structures is denoted by S, and its restriction to structures of length n by 
Sn- The unpaired positions Us in a secondary structure S = {n,P) is the set of indices k G [l,n] 
that are not involved in a base pair. A structure S is saturated if Us = 0. Given a sequence 
w and a structure S = (|r/;|,P), let Ui = e if i G Us and Ui = Wi, otherwise, where e is the 
empty sequence. Define the S'-paired restriction of w (paired restriction of S), denoted as Paired(tc, S) 
(Paired(5)), as (respectively, {{\ui ■ ■ ■ Ui\,\ui ■ ■ ■ Uj\) \ {i,j) G P}). A maximal subset 

B = {{i,j),{i + l,j — 1),..., {i + i, j — i)} of P for some integer i, j, I is called a band (sometimes 
referred to as helix or stem) of size I = |P|, of S' = {n,P). Note that every base pair belongs to 
exactly one band. 







Dot-parentheses notation. A well-parenthesized sequence s G {(,), .}* can be used to represent a 
secondary structure. There is one-to-one correspondence between secondary structures and such 
well-parenthesized sequences: any base pair {l,r) G S becomes a pair of corresponding opening and 
closing parentheses in s at position I and r respectively (s/ = ( and Sr = )), and any unpaired 
position i corresponds to a dot (s* = .). 

k-stutter. The fc-stutter of a sequence s, denoted by is the result of an independent copy /c-times 
of each of the characters in s. This operation can be applied to both RNA sequences and structures 
in the dot-parentheses notation. 

Tree representation. Alternatively, the tree representation, denoted by Ts, for S = (n, P) is a rooted 
ordered tree whose vertex set Vs consists of intervals [I, r] for any base pair {I, r) G P, and [/c, fc] for 
every k G Us. A virtual root [0, n-|- 1] is added for convenience. Each [k,k] node is called unpaired 
node, all other nodes (including the root) are called paired nodes. The children of an interval I G Vs 
are the maximal proper subintervals T G Vs oi I ordered by the left points of the intervals. The 
degree of a vertex / G is the total number of its paired neighbors, including its parent (if any). 
We denote by D{S) the maximal degree of nodes in Ts. 

Proper, greedy and separated eoloring of the tree representation. Consider the tree representation 
Ts of structure S. Color every paired node of Ts different from the root by black, white or grey 
color. This coloring is called proper if: 

1. every node has at most one black, at most one white and at most two grey children; 

2. a node with color c has at most one child with color c; 

3. a black node does not have a white child and a white node does not have a black child. 

A greedy coloring of Ts is the coloring obtained by recursive application of the following rule starting 
from the root and continuing towards leaves: if the node is black, color the first paired child black 
and the remaining paired children grey, if the node is white, color the first paired child white and 
the remaining paired children grey, otherwise (the grey node or the root), color the first paired child 
black, second white and the remaining paired children grey. It is easy to check that if the degree of 
each node is at most four then the greedy coloring is a proper coloring. 

Given a proper coloring of Ts, let the level of each node be the number of black nodes minus the 
number of white nodes on the path from this node to the root. A proper coloring is called separated 
if the two sets of levels, associated with grey and unpaired nodes respectively, do not overlap. 

2.1 Statement of the generic RNA design problem 

Consider an energy model At, which associates a free-energy Em{w,S) G U {-|-oo} to each 
secondary structure S G 5|^| for a given RNA sequence w. The minimum free-energy (MFE) structure 
prediction problem is typically defined as follows: 

RNA-F0LD_a 4 problem 

Input: RNA sequence w 

Output: SX^{w) := argmin^/g^i^l Em{w,S') . 

The existence of competing structures, having comparable or lower free-energy for a given RNA, 
impacts the well-definedness of the folding process. The detection of such situations is therefore of 
interest, and can be rephrased as follows: 



a. Target sec. str. S 



b. Invalid sequence for S 



c. Design for S 
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Fig. 2. The combinatorial RNA design problem: 
Starting from a secondary structure S (a.), our goal 
is to design an RNA sequence which uniquely folds, 
with maximum number of base pairs, into S. The 
sequence proposed in b. is invalid due to the exis¬ 
tence of an alternative structure (lower half-plane, 
red) having the same number of base pairs as S. 
The right-most sequence (c.) is a design for S. 


UNIQUE-FOLD^K problem 

Input: Sequence w + Energy distance Z\ > 0 

Output: True if, for every S" G 5|u,| \ S') > E^iw, S^(w)) + A. 

False otherwise. 

We can now define the combinatorial RNA Design problem as: 

RNA-DESIGN_a 4,17 problem 

Input: Secondary structure S -F Energy distance Z\ > 0 

Output: RNA sequence tc G A* — called an (A4, E, Z\)-design for S — such that: 

RNA-FOLD;kH =-5 and UNIQUE-FOLD^kIw', ^), 
or 0 if no such sequence exists. 

Structures for which there exists an (Af, 17, A)-design are called (Ad, A, Z\)-designable. Let 
Designable(Al, E, A) be the set of all such structures. If it is clear from the context, we will usually 
drop AI, E and/or A. 


2.2 Combinatorial design in a simple base pair energy model 

In this work, we adopt a Watson-Crick energy model W, which only allows pairs involving comple¬ 
mentary letters, i.e., in {C, G} and {A, U}. 

Definition 1 (Watson-Crick energy model W). 


Ew{w,S) 


—15| ify{l,r) G S, wi is complementary with Wr, 
+00 otherwise. 


We say that the structure is compatible with a sequence w, if Ey^{w, S) < -Foo. 

Minimizing Ey\;{w,S) is equivalent to maximizing IS*!, thus RNA-FOLDw is a classic base pair 
maximization problem. It can be solved by dynamic programming, historically in 0{n^) complexity 
in, or in 0(n^/log(n)) current best time complexity [H]. A backtracking procedure reconstructs 
the MFE structure, and can be easily adapted to assess the uniqueness of the MFE structure. 


3 Statement of the results 

We consider the design problem in a base pairing energy model W restricted to Watson-Crick 
base pairs {C, G} and {A, U}. We set Z\ = 1, which forbids designed sequence to adopt alternative 
structures having greater or equal number of base pairs than the target structure. Let us first 
characterize the sets Designable(A') of designable structures over partial alphabets E. Let Ec^u be 
an alphabet with c pairs of complementary bases and u bases without a complementary base. 
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Fig. 3. An example of undes- 
ignable (left) and designable 
structure (right). 
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Fig. 4. Application of the structure-approximating algorithm to the non-designable structure S in Fig.[^ A base pair 
(circled black node) is inserted in the greedily colored tree, offsetting the levels of white and unpaired nodes (crosses) 
to even and odd levels respectively, so that the resulting tree is proper/separated, representing a designable structure. 


Designability over restricted alphabets. 

R1 For every u G N"*", Designable(i7o,u) = {(n,0) | Vn G N}; 

R2 Designable(Z'i^o) = {5* G 5 | S' is saturated and D{S) < 2} U {(n, 0) | Vn G N}; 

R3 Designablejri’i) = {5 G 5 | D{S) < 2}. 

Designability over the complete alphabet 272,o = {A, U, C, G}. 

R4 Designable(272,o) n {5 G 5 | S is saturated} = {S' G 5 | D{S) < 4} n (S' G 5 | 5 is saturated}. 

When unpaired positions are allowed in the target structure, our characterization is only partial: 

R5 Let ms represent “a node having degree more than four”, and m^o be “a node having one or 
more unpaired children, and degree greater than two”, then 

Designable(272,o) n {S G 5 | 5 contains ms or mso} = 0; 

R6 Let Sep be the set of structures for which there exists a separated (proper) coloring of the tree 
representation, then Sep C Designable(272,o); 

R7 The set of 272,o-designable structures is closed under the A:-stutter operations: 

V5 G 5,VA; G N"*" : S G Designable(272,o) =► S'^^^ G Designable(272,o) • 

We note that S'^^l G Designable(272,o) does not imply that S G Designable(272,o)- For instance, it 
can be verified that is 272,o-designable, while S is not, cf. Figure p Membership to the classes 
described in R1-R5 can be tested by trivial linear-time algorithms, which can also be adapted into 
linear-time algorithms for the RNA-DESIGN^k,!; problem. 

Structure-approximating algorithm. Unfortunately, the absence of ms or mso, while necessary, is 
generally not sufficient to ensure designability. For instance, S in Figure [^clearly does not contain 
ms or mso, yet cannot be designed. In such cases, the unwanted interactions can be penalized by 
the duplication of some base pairs. For instance, duplicating the base pairs in the above example 
yields i72,o-designable structure S. 

R8 Any structure S without ms and m^o can be transformed in 0{n) time into a 272,o-designable 
structure S', by inflating a subset of its base pairs (at most one per band) so that the greedy 
coloring of the resulting structure is proper and separated, as illustrated by Figure]^ 
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Fig. 5. Construction the saturated structure compatible with the 
suffix V. The vertical line splits the sequence into a prefix u and a 
suffix V. Blue and red arcs depict saturated structures compatible 
with w and u respectively. Dashed red arcs represent the induced 
saturated structure compatible with v. 


4 Proofs 

R1 is trivial since, in the absence of complementary letters, the structures without base pairs are 
the only structures whose energy is not infinite. 

Theorem 1 (Result R4). A saturated sec. str. S is Ecfl-designable if and only if D{S) < 2c. 

Proof. First, we will show that the degree condition is necessary. Assume to the contrary that 
D{S) > 2c and S has a design w. Let [a, b] be a vertex with degree d > 2c+ 1 in Ts. Let {[k, 
be the (paired) children of [o, b] and the node [a, b] if [a, b] is not the root. Let Li = k and Ri = ri if 
[li, ri] is a child of [a, h], and Li = r* and Ri = k if it is [a, 6]. Then among bases ..., wl^ must 
be a pair of repeated letters. Let = wl- be such a pair with Lj < Lj. It is easy to check that 

5 \ {{li, ri), {lj,rj)} U {{Li, Rj), {Ri, Lj)} is a structure compatible with w with the same number 
of base pairs as S, a contradiction with the assumption that w is a design for S. 

To show that the degree condition is also sufficient, we need further definitions and claims. 
First, we say that a sequence w € E* is saturable if there is a saturated structure compatible with 
w. Note that the concatenation of two saturable sequences is also saturable. Then the following 
claim characterizes the cases when a saturable sequence can be split into saturable sequences. 

Claim 1.1. Let w = uv be a saturable sequence of length k. If u is saturable, then so is v. 

Proof. Consider a saturated structure S compatible with sequence w and saturated structure Su 
compatible with u. We will construct a saturated structure compatible with v. 

Consider a graph G with vertex set {1,... ,k} and edge set defined by pairs in S'U5„. Obviously, 
this graph is a collection of alternating paths (alternating between pairs from S and from Su, starting 
and ending with positions in v) and alternating cyclic paths, and it has a planar embedding such 
that all vertices lie on a line in their order: pairs in S are drawn as non-crossing arcs above the line 
and pairs in Su as non-crossing arcs below the line. Note that every position in v is an end-point 
of exactly one path in the collection. 

Dehne set of base pairs by pairing the end-points of the paths in G, cf. Figurej^ We will show 
that Su is a structure. Consider a graph G' constructed by adding pairs in Su to G. This graph 
is a collection of cyclic paths. Consider an embedding of G' into plane that extends the planar 
embedding of G by adding arcs corresponding to the pairs in Su below the line containing all the 
vertices. If two base pairs b, b' G Su cross then the cyclic path containing b and the cyclic path 
containing b' intersect in exactly one point. By Jordan’s curve theorem, this is a contradiction. It 
follows that Su is a saturated structure, and hence v is also saturable. □ 

We define w to be an atomic saturable sequence if no proper prefix of w is saturable. Clearly, 
every saturated structure compatible with an atomic saturable sequence w contains the base pair 
(1, |u)|). On the other hand, by Claim [m if every saturated structure compatible with w contains 
the pair (1, |u;|), then w is an atomic saturable sequence. A design w that is also an atomic saturable 
sequence will be called an atomic saturable design. A concatenation of two or more atomic saturable 
designs is obviously not an atomic saturable sequence and it is not necessarily a design. However, 
we have the following claims. 




Claim 1.2. The concatenation oft atomic saturable designs .. .w^ for structures such 

that w\ 7 ^ VI < i < j < t, is a design for the concatenated (saturated) structure S = ■ ■ ■ S’!*!. 


Proof. Assume that W := ■ ■ -w^ \s not a design, then there exist a saturated structure S' S 

for W. We show that positing such an alternative structure leads to a contradiction, reminding 
that each S' is saturated and contains a pair (1, |u)*|). Assume that there exists a leftmost word Wi, 
i G [1, |t|], such that w\ is not paired with in S'. If w\ is not paired, then S' is not saturated, 

a contradiction. Let w)., j > i, be the partner of w\ in S", and let u := w' ■ ■ ■ If A: = \w^\, 

then j > i and, by complementarity, w\ = w\ which contradicts the preconditions. Hence, we can 

nssiimp tliat k < \w^\. Since u and each of the w ',..., are saturable, b''^ itpmtprl nnniipntinn of 

we conclude that v = wf ,, is saturable as well and, from Claim 
_ [i.fc] ’ 

This contradicts the hypothesis that = v.v' is an atomic saturable design, since there exists a 

saturated folding for which does not pair its extremities. Consequently, S' pairs the first and 

last letters in each w^, hence S' = S since each w' is a design, again a contradiction. We conclude 

that no alternative saturated folding exists for W, i.e. IT is a design for S. □ 


Claim 


0 


1.1 


so is = 


Claim 1.3. Consider t atomic saturable designs = tcj • • • ■ ■ •; = w\ - ■ ■ and a 

pair a,b of complementary letters such that w\ b for every 1 < i < t and w\ / w\ for every 
1 < * < J < L Then W = aw^ ■ ■ ■ w''b is an atomic saturable design. 


Proof. We will first show that IT is an atomic saturable sequence. Assume to the contrary that 
there is a proper prefix of IT that is saturable. Consider the shortest such prefix aw^ ■ ■ ■ w'w 


Obviously, a has to be paired with w 
implies that b = w'- 
Claim 


i+l 


1.1 


[ijT 

otherwise we can find a shorter saturable prefix. This 
and that is saturable as well. By repeated application of 

we have that is saturable. Since it is a prefix of atomic saturable sequence 

it must be the empty sequence, i.e., j = 1. Therefore, b = a contradiction with the 

assumptions of the claim. Thus, IT is an atomic saturable sequence. 

Now we will show that IT is a design. Consider any MFE (saturated) structure S for IT. Since 
IT is atomic saturable, a is paired with b in S. By Claim 1.2, ■ ■ ■w'' is a design. It follows that 

IT is a design as well. □ 


To prove the sufficiency of the degree condition, consider the following algorithm, which takes 
as input a saturated structure S with D{S) < 2c, and returns a design re for S': 

— Let {[li,ri]}f^^ be the children of the root. Assign to each wi.,Wri complementary bases such 
that yi < i < j < d : wi^ wi-. 

— While there exists an unprocessed internal node [a, b] whose parent has been processed (if there 

is no such node, stop and return w). Let {[^i, be the children of [a, 6]. Assign to each 

wii, Wri complementary bases such that VI < i < d : wi^ Wa and Ml < i < j < d : wi^ wi-. 


Note that since the alphabet contains c pairs of complementary bases, the assignment at each 
step of the algorithm is possible. We will show that the returned sequence tc is a design for S. We 
will show by tree induction on the size subtrees that Wf ■ ■ Wj is an atomic saturable design for 
every internal node [i,j]. It is easy to check that this is satisfied at the leaves. Consider an internal 
node u. By the induction hypothesis, sequences for each child subtree of u are atomic saturable 


designs. Furthermore, by the choice of bases at children nodes of u, all assumptions of Claim 1.3 
are satisfied, hence, the sequence for node u is also an atomic saturable design. The claim holds. 
Finally, we can apply Claim 1.2 at the root, which yields that tc is a design. □ 








Corollary 2 (Result R2). A structure S is Eifi-designable if and only if it does not contain any 
base pairs, or it is saturated and D{S) < 2. 

Proof. If S contains a base pair and an unpaired position, then it can be easily checked that S is not 
lii^o-designable. Hence, any i7i^o-designable structure is either empty, and trivially designable using 
a single letter, or saturated. In the latter case, by Theorem we know that designable structures 
are exactly those that are saturated, and such that D{S) < 2. The claim follows. □ 

Corollary 3 (Result R3). A structure S is Hi^i-designable if and only if D{S) < 2. 

Proof. First, suppose S is T'lp-designable and let w he a design for S. Then Paired(rt;, S) is a design 
for Paired(5). Since the paired restriction Paired(S') is saturated, it is over alphabet C Ei^i, 
and by Theorem[^ L>(Paired(5)) < 2. Hence, D{S) = L>(Paired(5')) < 2. 

Conversely, suppose that D{S) < 2. Construct a design for S as follows. Since Paired (5) is 
saturated, by Theoremthere is a design w for Paired(5) over C Ei^i. Construct w from w 
by inserting the base without a complementary base at every unpaired position of S. Let S' be an 
MFE structure for w. Obviously, all unpaired positions in S are also unpaired in S'. We must have 
Paired(S'^) = Paired(S'), otherwise we have an alternative structure for w, a contradiction. Hence, 
S' = S, i.e., w is a design for S. □ 

Result R4 follows readily from Theorem by taking c = 2. 

Lemma 4 (Result R5). Any structure that contains ms or m^o is not E 2 fi-designable. 

Proof. Assume that S is A' 2 ,o-designable and let w he a design for S. Then Paired(rc, S) is a design 
for Paired(5). Since Paired(5') is saturated, by Theorem D{S) = ZI(Paired(5)) < 4, hence, S 
cannot contain motif ms. Now, assume to the contrary that S contain motif m^o appearing at node 
[a, 6] of Ts. Let {[^i, be some paired children of [a,b] and the node [a, 6] if [a, 6] is not the 

root, and [u,u] an unpaired child of [a,b]. Let Li = k and Ri = ri if [li,ri] is a child of [a,b], and 
Li = ri and Ri = k if it is [a,b]. If among bases wli, ■ ■ ■ ,wls there is a pair of repeated letters, 
then we can construct an alternative MFE structure for w (see the first paragraph in the proof of 
Theorem [^. Assume that these three bases are different. Then for some i = 1, 2, 3, Wu equals either 
wi- or Wn, say it equals wi-. Then 5\{(/j,rj)}U{(ti,ri)} is an MEE structure for S, a contradiction 
with the assumption that tc is a design for S. □ 

Theorem 5 (Result R6). If the tree representation of a structure S admits a separated coloring 
then S is S 2 fi-designable. 

Proof. Given a sequence w, we define the level L{i) of position i as L{i) = — |tC[i_j]|c. 

Claim 5.1. Consider any structure compatible with sequence w that contains some A — U base pair 
between positions at different levels, then some G or C is left unpaired. 

Proof. Consider that the A — U base pair occurs at position {a,b), and note that the bases of the 
substring can only base pair among themselves without introducing crossings. We will 

show that G’s and C’s are not balanced in this substring. Since Wb G {A, U}, L{b) = L{b— 1). Hence, 
by the definition of L, we have that 

k[a+i,b-i]|G - |w[a+i,6-i]|c = L{h - 1) - L(a) = L{h) - L{a) / 0. 

Therefore, at least one G or C in the substring remains unpaired in this structure. □ 


Consider a separated coloring of the tree representation of S. We will use this coloring to 
construct a design w for S, by specifying a nucleotide at each position of w. First, for each unpaired 
position i, set tc* = U. Second, apply a modified version of the algorithm described in Theorem 
to set the bases of paired positions in which black nodes are assigned to base pair G — C, white 
nodes to C — G and grey nodes to A — U or U — A. The algorithm ignores unpaired nodes in the tree 
representation of S. Since the coloring is proper such assignment is always possible at every step of 
the algorithm. We claim that for any node [i,j] (paired or unpaired), the level of position i is the 
same as the level of the node \i,j\- To verify this, observe that the substring of w corresponding to 
any subtree has the same number of G’s and C’s. Hence, for any node [i,j], the level of position i 
depends only on nodes on the path from this node to the root. It is easy to check that the level of 
i is equal to the level the node. Note that if [i,j] is a grey node then the level of position j is the 
same as the level of i, i.e., the same as the level of [i,j]. 

We will show that the constructed w is a design for S. Since all C’s and A’s of w are paired in S, 
S is an MFE structure for w. We need to show that it is the only MFE structure for w. Consider an 
MFE structure S' for w different from S. Since w has the same number of G’s and C’s, S' must pair 
all G’s, C’s and A’s of w. We will show that that all unpaired positions in S are also unpaired in S'. 
Assume to the contrary that position i is unpaired in S, but it is paired to j in S' . We must have 
Wi = \J and Wj = A. Since the coloring is separated, the unpaired node [i, i] has a different level 
than the grey node containing j, and hence, the level of i is different from the level of j. It follows 


by Claim 5.1 that some G or C is unpaired in S' , a contradiction. Consider paired restrictions of 
S, S' and w. Both Paired(S) and Paired(S') are saturated and compatible with Paired(t(;, 5) and 
they are different since S and S' are different and agree on the unpaired positions. Furthermore, 
Paired(u;, S') can be produced by the algorithm described in Theorem for the input structure 
Paired(S'), and hence, by Theorem]^ Paired(t(;, S') is a design for Paired(S'), which contradicts the 
existence of Paired(5'). Hence, w is a design for S. □ 


Theorem 6 (Result R7). If w is a design for a structure S, then for any integer k > 1, is 
a design for S"!^!. In particular, if a structure S is Z! 2 ,o-designable, then so is S'l*’!. 


Proof. Consider a designable structure S and let tc = tci • • • be a design for S'. We will show 
that is a design for S^^\ Let the i-th k positions in S be called the region i. Note that the 
positions in region i of S^^^ correspond to the i-th position in S. 

First, we will show that S^^^ is an MFE structure for Consider an MEE structure S' of 
Define an interaction graph of S', denoted by I{S') = {Vj(^s')^ ^i{S'))i follows: the vertex set 
V7(5/) is the set of positions in w, i.e., {!,... ,n}, and there is an edge between i and j in I{S') if 
there exists a pair between a position in region i and a position in region j in S' . Note that I (S') is 
a bipartite graph: indeed, vertices of any cycle in I(S') are positions in w that alternate between A 
and U, or between C and G. Also note that I(S') is an outer-planar graph: base pairs are pairwise 
non-crossing and can therefore be drawn without crossings on the upper half-plane, leaving the 
lower half-plane on the outer face. Assign each edge e in a weight c(e) equal to the number of 

pairs between regions i and j in S'. Note that the sum of all weights in I{S'), denoted as ||E/( 5 /)||, 
equals |S"|. We have the following claim. 

Claim 6.1. If M is a maximum matching in I{S') then |5'| < k\M\. Moreover, if lA'I = k\M\ 
then every minimum vertex cover of I {S') covers every edge exactly once. 


Proof. Note that for any vertex i in V/( 5 /), the sum of the weights of edges incident with i is at 
most k. Consider a smallest vertex cover C of I {S'), and take the sum of these inequalities over all 



vertices i in the cover C: 


( 1 ) 


Y - ^ 1*^1 • 

iGC e incident with i 

Since C is a vertex cover, the weight of every edge in -£ 7 ( 5 /) appears at least once on the left side 
of Q, hence = ||i?/( 5 /)|| < k\C\. By Konig’s Theorem, the maximum matching M in I{S') has 
the same number of edges as C, i.e., |5"| < A:|M|. The equality implies that the weight of every 
edge in appears exactly once on the left side of Q, i.e., that vertex cover C covers every 

edge exactly once. □ 


Given a matching M in I {S'), we can construct a structure Sm for w with \M\ pairs as follows: 
for every edge {i,j} in M, add pair (i,j). This is a valid (pseudoknot-free) structure, since M is 
a subgraph of outer-planar grap h I {S'). It follows that \M\ < |5|. If M is a maximum matching 

that \S'\ < k\M\ < k\S\ = |5[^l|i.e., 


on I{S'), we have by Claim 6.1 that IS"] < k\M\ < k\S\ = [S^^^^li.e., is an MFE structure for 
It also follows that IS"! = k\M\ and that \M\ = |5|. Since 5 is a unique structure for w and 
|5 m| = \M\ = |5|, we have that Sm = S, i.e., there is only one maximum matching in I{S'). We 
need the following claim to show that all connected components in I {S') have at most 2 vertices. 


Claim 6.2. Let G be a connected bipartite graph on at least three vertices with unique maximum 
matching M. Then there exists a minimum vertex cover of G that covers some edge twice. 


Proof. First, we will show that every vertex in G is incident to an edge in matching M. Assume 
the contrary and consider all vertices in G which are incident to only non-matching edges. If two of 
these vertices are incident then the matching is not maximal. Otherwise, let u be such a vertex and 
V its neighbor. Vertex v must be incident to a matching edge. We can construct a new matching 
by removing this edge and adding edge uv, which contradicts the assumption that M is a unique 
maximal matching. 

Take a maximal path P alternating between matching and non-matching edges in G. Let u be 
an endpoint of P and e the edge on P incident to u. If e is a non-matching edge then u must be 
incident to a matching edge, say /. By maximality of P, the other endpoint v of f must be on P. 
Since every internal vertex of P is incident to a matching on P, v must be the other endpoint of P 
and the edge incident to f on P must be a non-matching edge. Hence, we have an alternating cycle 
P + f which contradicts the uniqueness of the maximal matching. Thus, P starts and ends with 
matching edges. Next, we show that u is a pendant vertex (has degree one). Assume to the contrary 
u is incident to another (non-matching) edge / = uv. By maximality of P, v is on P, which yields 
a cycle. If this cycle is even, we have an alternating cycle, which contradicts the uniqueness of the 
matching, and if it is odd, we have a contradiction with the fact that G is bipartite. Hence, both 
endpoints of P are pendant. 

Consider a minimum vertex cover G of G. By well-known Konig’s theorem, every minimum 
vertex cover in a bipartite graph uses exactly one endpoint of every edge of a maximum matching 
and no other vertices. Since the endpoints of P are pendant, and G is connected and has > 3 vertices, 
P must have at least three edges. Since endpoints of P are pendant and incident to matching edges, 
we can assume that G does not contain endpoints of P, i.e, contains the second and last by one 
vertex of P. It is easy to see that at least one non-matching edge is covered twice by G. □ 


Consider a connected component K of I {S'). Since I {S') has a unique maximum matching, so does 
K. If K has more than two vertices, it contains a minimum vertex cover of K that covers some 
edge twice. It follows that there is a minimum vertex cover of I {S') that covers some edge twice. 
Hence, by Claim 6.1, |S''| < k\M\, a contradiction. It follows that every connected component of 


I {S') has at most two vertices, hence, either S' is not MFE or S' = 


□ 




Theorem 7 (Result R8). Each structure S without and m^o can be transformed into a T' 2 , 0 - 
designable structure S' by inflating a subset of its base pairs (at most one per band). Furthermore, 
this transformation can be done in 0(n) time. 

Proof. We start with the greedy coloring of Tg. Since S does not contain 7715 and mso, it is a proper 
coloring and there is no node having both a grey child and an unpaired child. We will insert base 
pairs within S so that the grey nodes and any unpaired node end up at levels of different parities. 
If the root has a grey child, assign even parity to the grey nodes, otherwise (if the root has an 
unpaired child, or no grey and no unpaired children), assign even parity to the unpaired nodes. 

Now we proceed from the children of the root towards leaves adjusting parity level for grey and 
unpaired nodes to keep one type even and the other one odd. We repeatedly apply the following 
simple operation on Tg: If the node N does not match its intended parity level. Denote Np the 
parent of N (Np is not the root as all children of the root already have the correct parity level) and 
Npp the parent of A^p. Insert a new paired node Njy between Npp and Np, assign it with the color 
of A^P, and apply the greedy algorithm on A^tv- Observe that Np always takes either black or white 
color changing the parity level of all its descendants (including N). Note that the children of A^p 
may get recolored, we can even get one more grey child but after this operation the parity levels of 
all children of N are correct and we do not change parity levels outside the subtree rooted at N. 
After fixing all nodes, we get a separated proper coloring (which is actually the greedy coloring) of 
Ts'. Hence, by Theorem S' is designable. Figure [^illustrates this process. □ 

5 Conclusion, discussion and perspectives 

In this work, we introduced the Combinatorial RNA Design problem, a minimal instance of the 
RNA design problem which aims at finding a sequence that admits the target structure as its unique 
base pair maximizing structure. First, we provided complete characterizations for the structures 
that can be designed using restricted alphabets. Then we considered the RNA design under a four- 
letter alphabet, and provide a complete characterization of designable saturated structures, i.e., 
free of unpaired positions. Turning to those target structures that contain unpaired positions, we 
provided partial characterizations for classes of designable/undesignable structures, and showed 
that the set of designable structures is closed under the stutter operation. Finally, we introduced 
structure-approximating version of the problem and, assuming that the input structure avoids two 
motifs, provided a structure approximating algorithm of ratio 2 for general structures. 

An important question that is left open by this work is the computational complexity of the 
RNA design problem. Schnall-Levin et al. m established the NP-hardness of a more general prob¬ 
lem, called the inverse Viterbi algorithm, which takes as input a stochastic grammar (representing 
the energy model) and a targeted parse tree (representing the structure), and outputs a sequence 
(design) whose most probable parsing should match the target. However this result does not settle 
the complexity of the RNA design, essentially because the proposed reduction relies critically on 
an encoding of 3-SAT instances within the input grammar. While the hypothetical perfect gram¬ 
mar/energy model for RNA folding probably differs from the currently accepted Turner model, it 
should ultimately reflect the laws of physics and should certainly not depend on the instance. As 
the reduction m requires a different grammar (i.e., energy model) for each instance, it does not 
seem easily adaptable into a proof that holds for a fixed energy model. Consequently, despite two 
decades of work on the subject, the computational tractability of RNA design is still open, either 
in its general instance and in our combinatorial version. 

Besides complexity issues, natural extensions of this work may include the consideration of more 
general base pairing models, more realistic energy models (ideally, the Turner energy model |2U|1. 


or the design under other objectives, such as the Boltzmann probability [22]. However, even the 
simplest of modifications, allowing G — U base pairs, would invalidate parity properties that are 
critical to the proofs of some of our results and algorithms. More precise bounds for the ratio of the 
structure-approximating could be established. Finally, better algorithms could be designed for the 
problem, attempting to minimize the number of modifications so that a given structure becomes 
designable (or, more modestly, belongs to an identified class of designable structures). 
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